Demonstration of Over-fitting in Regression

Over-fitting a model means that the fit of the model would not apply well to other data sets and that’s pretty much the whole point of actually fitting the model in the first place. When there’s a lot of variables being tested and with less sample size there is more potential to find statistical significance that may not be applicable to other data. R2 is typically used for model fit for a regression but there are methods to alter the R2 to try and correct for any over-fitting that may be occurring. Adjusted R2 uses a mathematical formula based on sample size and number of variables to adjust R2 downward to its true estimation. There are also methods that involve iterating through the sample while leaving some of the data out to get an average R2. This site gives a good overview of methods like that such as k-folds and leave-one-out cross validation.

Uncorrelated data was simulated to test how R, R2, and R estimated using leave-one-out, K-Fold, and repeated K-Folds cross validation performed. The simulation code is shown below.

r <- 100  #number of replications
varnum <- c(5, 10, 15, 20, 25, 30)
sampnum <- c(50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600,
             650, 700, 750, 800, 850, 900, 950, 1000)
out.all <- NULL #creates blank sheet where every row will be one simulation result

for (v in varnum)
  {
for (s in sampnum)
  {
for(i in 1:r) #r is number of simulations/replications
{
  set.seed(200+i) #seed so that others can run with the same data and the means 
  
  maxdataframe <- as.data.frame(matrix(rnorm(50000*50,0,1),ncol=50))
  maxdataframe$DV <- rnorm(n = 50000, mean = 0, sd = 1)

  subsetdata <-  maxdataframe[c(1:s), c(1:v, 51)]

  #regular regression with r square, adj r-square
  model  <- lm(DV ~. , data = subsetdata)
  summodel <- summary(model)

  rsquare <- summodel$r.squared
  adjrsquare <- summodel$adj.r.squared
  coefftable <- as.data.frame(summodel$coefficients)
  coefftable = coefftable[-1,]
  coefftable$sig <- ifelse(coefftable$`Pr(>|t|)`<.05,1,0)
  sumsig <- sum(coefftable$sig)


  #regression with Leave one out cross validation
  train.control1 <- trainControl(method = "LOOCV")
  # Train the model
  model1 <- train(DV ~., data = subsetdata, method = "lm",
               trControl = train.control1)
  LOOCVR2 <- model1$results$Rsquared


  #K-fold cross-validation 10
  train.control2 <- trainControl(method = "cv", number = 10)
  #Run the model
  model2 <- train(DV ~., data = subsetdata, method = "lm",
               trControl = train.control2)
  kfold10R2 <- model2$results$Rsquared

  #Repeated K-fold cross-validation 10
  train.control3 <- trainControl(method = "repeatedcv", 
                              number = 10, repeats = 3)
  #Run the model
  model3 <- train(DV ~., data = subsetdata, method = "lm",
               trControl = train.control3)
  Rkfold10R2 <- model3$results$Rsquared
  
  out.one <- data.frame(i, v, s, rsquare, adjrsquare, sumsig, LOOCVR2, kfold10R2, Rkfold10R2)
  
  out.all <- rbind(out.all, out.one) 
}}} 

Below is a plot showing the R measures by estimation method by number of variables and sample size.

You would think that plain old R-Square would be the highest (and thus more over-estimating) value in all instances and while that happens in most instances, when there is a low number of variables K-fold and repeated K-fold is actually as high or higher. In all k-folds there were 10 folds so it could be that the fold numbers were not adequate for those sample and variable sizes. Leave-one-out tends to do a good job of not overestimating model fit and so does adjusted R-Square. Future analysis may look at how these measures deal with correlated variables.